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Summary 

Model  determination  is  divided  into  the  issues  of  model  adequacy  and  model 
selection.  Predictive  distributions  are  used  to  address  both  issues.  TWs  seems  natural 
since,  typically,  prediction  is  a  primary  purpose  for  the  chosen  model.  A  cross-validation 
viewpoint  is  ar^ed  for.  In  particular,  for  a  given  model,  it  is  proposed  to  validate 
conditional  predictive  distributions  arising  from  single  point  ddetion  against  observed 
responses.  Sampling  based  methods  are  used  to  carry  out  required  calculations.  An 
example  investigates  the  adequacy  of  and  rather  subtle  choice  between  two  sigmoidal 
growth  models  of  the  same  dimension. 
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1.  Introdnction 

Responsible  data  analysis  must  address  the  issue  of  model  determination  which 
divides  into  two  components;  model  assessment  or  checking  and  model  choice  or  selection. 
That  is,  apart  from  rare  situations,  the  model  specification  is  never  "correct".  Rather  the 
questions  are  (i)  is  a  given  model  adequate?  (ii)  within  a  collection  of  models  under 
consideration,  which  model  is  the  best  choice?  The  questions  are  distinct.  We  need  not 
envision  a  collection  of  models  to  address  (i).  Conversely,  amongst  a  collection  of  models 
several  may  be  adequate  (or  perhaps  all  may  be  inadequate)  but  we  still  seek  the  "best" 
one.  Nonetheless  in  practice  the  questions  are  typically  investigated  concurrently.  This 
paper  adopts  a  predictive  viewpoint  for  answering  them  with  resultant  overlapping 
methodology. 

The  literature  on  model  assessment  and  model  choice  is,  by  now,  enormous.  Of 
course,  our  modeling  framework  here  is  Bayesian  whence  our  approach  to  these  problems  is 
as  well.  We  restrict  ourselves  to  parametric  models  where  the  amount  of  data  is  fixed  and 
which  are  expressible  in  the  form  Likelihood  »  Prior  where  the  Likelihood  is  an  explicit 
readily  evaluable  function  of  both  the  data  and  the  parameters  with  the  prior  a  readily 
evaluable  function  of  the  parameters. 

Model  adequacy  is  considered  with  regard  to  the  form  of  the  product.  Box  (1980)  is 
persuasive  on  behalf  of  the  predictive  stance  in  arguing  that  the  adequacy  of  a  model  can 
not  be  assessed  from  the  posterior  distribution  of  the  model  parameters.  Berger  (1985) 
observes  that  "Bayesians  have  long  used  [predictive  distributions]  to  check  assumptions". 
A  few  additional  references  are  Jeffreys  (1961),  Box  and  Tiao  (1973),  Dempster  (1975), 
Rubin  (1984)  and  Geisser  (1985). 

Turning  to  model  selection  we  need  to  clarify  our  objective.  For  a  fixed  likelihood 
the  prior  may  be  varied.  This  case  is  t3T)ically  not  viewed  as  one  of  model  selection  but 
rather  one  of  model  robustness  (see  Berger  (1984)  for  a  review).  More  recent  work  is 
summarized  in  Gelfand  and  Dey  (1991).  In  this  work  the  estimative  side  using  the 
resultant  posterior  is  usually  investigated  with  regard  to  sensitivity  to  such  variation. 
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Rather,  the  -osual  business  of  model  selection  is  the  specification  of  the  likelihood  or 
joint  distribution  of  the  data.  For  instance,  in  the  case  of  response  models  one  needs  to 
specify  the  parametric  form  of  the  error  distribution  (equivalently,  a  transformation  of  the 
response  variable),  a  parametric  form  for  the  mean  and  a  parametric  form  for  the  variance. 
(See  in  this  regard  the  recent  work  of  Carlin  and  Poison  (1991)  and  George  and  McCulloch 
(1991).  There  is  an  implicit  trade-off  hare  (Smith,  1986)  e.g.  complex  form  for  the  mean 
with  pure  Gaussian  error  versus  simple  specification  of  the  mean  with  say  a  mixture  of 
Gaussians  as  error.  Hence  judicious  choice  of  the  collection  of  models  entertained  is 
essential. 

Our  discussion  is  concerned  with  model  choice  in  this  broad  sense  emphasizing  that 
predictive  distributions  enable  arbitrary  model  comparisons.  Moreover  the  predictive 
viewpoint  is  typically  consonant  with  the  intended  use  of  the  model.  In  the  special  case 
where  models  under  consideration  are  nested  so  that  a  subset  of  components  of  the 
parameter  vector  may  be  viewed  as  a  discrepancy  parameter  measuring  departure  from  a 
baseline  model  (Box  1980)  model  choice  reduces  to  posterior  inference  for  this  subset  of 
parameters.  In  the  absence  of  nesting,  comparison  of  posteriors  between  models  typically 
conveys  little  information.  The  models  need  not  share  the  same  dimensionality.  Moreover, 
even  if  they  do,  from  one  to  another,  the  parameters  will  have  different  interpretations. 

Model  determination  is  closely  related  to  other  data  analytic  issues  such  as  residual 
analysis,  influence  measures  and  outlier  detection.  We  offer  no  elaboration  here  but  note 
the  recent  papers  of  Pettit  and  Smith  (1985),  Geisser  (1987),  Chaloner  and  Brant  (1988), 
Verdinelli  and  Wasserman  (1991)  and  Guttman  (1991). 

The  entire  proposed  enterprise  rests  upon  our  ability  to  obtain  desired  predictive 
distributions  and  to  calculate  expectations  under  these  distributions.  Analytic  evaluation 
of  the  required  integrals  is  generally  hopeless  with  effective  approximations  only  available 
for  simpler  cases.  To  accomplish  needed  integrations  we  propose  the  use  of  sampling— based 
methods  as  discussed  in  Rubin  (1988)  and  in  Gelfand  and  Smith  (1990).  Such  calculation 
is  very  computer-intensive,  particularly  when  undertaking  more  complex  and  realistic 
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modeling.  However,  the  routine  availability  of  enormous  computing  power  makes  this  no 
obstacle.  In  particular  this  implies  that  many  contending  models  may  eventually  be 
considered.  As  Geisser  (1988)  notes,  "this  could  create  havoc....  for  those  who  adhere  to 
stringent  versions  of....  Bayesian  approaches."  However  we  feel  that  the  art  of  data 
analysis  should  not  be  bound  by  overly  formal  inferential  frameworks. 

This  paper  is  essentially  a  synthesis  and  extension  of  earlier  work  in  Bayesian  model 
determination.  Our  main  contributions  are  the  fleshing  out  of  the  cross-validation 
approach  and  the  presentation  of  straightforward  computing  procedures  to  implement  such 
cross-validation.  In  Section  2  we  detail  the  proposed  predictive  approaches  for  model 
determination.  In  Section  3  we  describe  the  Monte  Carlo  techniques  for  performing  the 
required  calculations.  An  illustrative  example  is  discussed  in  Section  4  and  we  offer  brief 
conclusions  in  Section  5. 

2.  Predictive  Approaches  for  Model  Determination 
2.1  Predictive  densities 

Box  (1980)  discusses  the  complementary  roles  of  the  predictive  and  posterior 
distributions  in  Bayesian  data  analysis  noting  that  the  posterior  distribution  provides  a 
basis  for  "estimation  of  parameters  conditional  on  the  adequacy  of  the  entertained  model" 
while  the  predictive  distribution  enables  "criticism  of  the  entertained  model  in  the  light  of 
current  data."  We  concur  noting  further  that  in  comparing  models  predictive  distributions 
are  directly  comparable  while  posteriors  are  not.  The  predictive  distribution  (or  marginal 
likelihood)  is  the  joint  marginal  distribution  of  the  data.  This  distribution  may  be  used  in 
various  ways  to  examine  model  determination  questions.  Our  approach,  which  is  argued 
for  below,  is  a  cross-validation  one  and  leads  to  the  examination  of  a  collection  of 
conditional  distributions  arising  from  this  joint  distribution. 

We  use  customary  notation  in  letting  upper  case  letters  denote  random  variables  and 
lower  case  letters  denote  the  observed  realizations  of  these  variables  in  our  sample.  In 
particular,  the  observed  value  of  the  r^**  response,  Yr,  in  our  sample  is  jr.  Let  Y  denote  the 
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nxl  data  vector,  and  let  Y(r)  denote  the  n-l  »  1  data  vector  with  deleted.  Let 
denote  the  matrix  of  explanatory  variables  whose  r^**  row,  Xr  is  associated  with  yr.  Let 
X(  r)  denote  the  matrix  which  is  X  with  the  r*'^  row  deleted.  All  densities  for  the  data  will 
be  denoted  by  f,  all  densities  for  the  parameters  will  be  denoted  by  x  vnth  arguments 
providing  clarification.  More  precisely  for  a  given  model  let  0  denote  the  vector  of  model 
parameters.  Then  f(Y|  5,X)  is  the  joint  density  of  the  data  given  0  and  is  the  prior 
density  for  0  whence  f(Y|  tf,X)  •  7r(^)  is  the  model  specification.  Both  Y  and  0  are 
presumed  continuous. 

Assuming  that  the  integral  exists  the  predictive  density  is  f(Y)  =  J f(Y|  0,X)  7{0)dO. 
In  model  selection,  interest  focuses  on  f(Yj  5,X)  and  iris  often  chosen  to  be  vague.  But  if  ir 
is  improper  then  f(Y)  necessarily  is,  making  it  awkward  to  use  in  model  checking.  Note 
however,  that 


I(Vr|Y,r,)  =  ^4^  =  /t(Y,i»,Y,„,X)><«|Y,„)dO  (1) 

is  proper  since  ;r(^|Y(r))  is.  The  density  f(Yrl  ^,Y(r)  ,X)  is  immediate  if  the  Yr  are 
conditionally  independent  given  6  as  well  as,  for  example,  if  f(Y|  ^>,X)  is  multivariate 
normal. 

Suppose  that  f(Y)  is  proper  and  strictly  positive  over  its  domain.  Then  f(Y)  is 
equivalent  to  the  set  {f(Yr|  Y(r) )  :  r=l,-  •  •,n}  in  the  sense  that  each  uniquely  determines 
the  other  (Besag,  1974).  Hence,  in  terms  of  model  assessment,  examining  the  observed  y 
with  respect  to  f(Y)  is  the  same  as  with  respect  to  the  set  of  f(Yr|  Y(r)).  It  may  be  easier 
to  work  with  the  latter  distributions  since  each  is  univariate. 

We  briefly  argue  that,  in  checking  against  the  observed  yr,  f(Yr|Y(r))  is  the 
preferred  univariate  predictive  distribution  for  Yr.  If  f(Y)  is  proper  we  could  consider  the 
univariate  marginal  f(Yr).  Of  course  the  f(Yr)  do  not  determine  f(Y)  but  more  importantly 
f( Yr)  ignores  the  remaining  observations,  y ( rj  •  In  practice,  were  we  to  attempt  prediction 
of  a  new,  not  necessarily  independent,  Yr  at  Xr  we  would  use  a  posterior  distribution  for  $ 
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in  creating  the  desired  predictive  distribution.  In  assessing  the  model  this  seems 
appropriate  as  well;  we  should  check  how  well  the  model  predicts  in  the  manner  in  which 
we  would  use  it  to  predict.  But  then  should  we  use  f(Yr|7)  or  f(Yr|y(r) )?  The  former  is 
used  for  prediction  at  a  new  vector  say  Xo  but  for  validation  at  an  Xr  for  which  we  have 
already  observed  yr  the  latter  seems  preferable.  That  is,  if  we  propose  to  check  the 
predictive  distribution  for  Yr  against  an  observed  yr  we  should  not  use  that  yr  to  determine 
this  distribution.  Note  that  f(Yriy(r))  may  be  quite  different  from  f(Yr|y)  even  in  the 
case  that  the  Yr  are  conditionally  independent  given  A 

Such  cros:  validation  is  well  established  in  the  Bayesian  literature  dating  at  least  to 
Stone  (1974)  and  Geisser  (1975).  Frequentist  model  diagnostic  approaches  adopt  a  similar 
point  of  view  (see  e.g.  Belsley,  Kuh  and  Welsch,  1980  or  Cook  and  Weisberg,  1982).  Cross 
validation  schemes  other  than  single  point  deletion  may  be  helpful  and  will  share  the  same 
advantages  described  above.  However  in  the  sequel  we  use  f(Yr  1  yt  r) )  exclusively. 

2.2  Model  adequacy 

The  predictive  distributions,  f(Yr|y(r)  )>  are  to  be  checked  against  yr  for  r=l,2,‘  •  ‘U 
in  the  sense  that,  if  the  model  holds,  yr  may  be  viewed  as  a  random  observation  from 
f(Yrly(r))-  To  do  this  we  consider  g(Yr;  yr).  called  a  checking  function  by  Box  (1980), 
whose  expectation  under  f(Yr|y(  r) )  we  will  calculate  and  denote  by  dr.  The  set  of  dr  will 
be  used  for  model  assessment.  Computation  of  the  dr  is  discussed  in  Section  3.  Once 
obtained  the  approach  is  exploratory.  In  fact,  since  each  dr  is  a  function  of  the  entire  data 
vector  Y  they  will  be  strongly  dependent  making  formal  inference  very  difficult.  Our 
strategy  is  a  Bayesian  analogue  to  well  accepted  frequentist  strategy  of  examining 
studentized  residuals,  DFFITS,  DFBETAS  etc.,  (again  see  Belsley,  Kuh  and  Welsch,  1979 
or  Cook  and  Weisberg,  1982). 

We  will  look  at  several  choices  of  g.  For  example 
(i)  gi(Yr;  yr)  =  yr  -  Yr  yielding  dir  =  7t  -  fh  where  fh  =  E(Yr|y( r)  )•  The  dir  are 

natural  deviations  or  residuals  mentioned  in  Geisser  (1987,  p.  138).  With 
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ff j  =  var  (Yrly(r)),  standardizing  yields  d'^  =  dir/^r.  The  quantity  S(d'jj) 
could  be  used  as  an  index  of  model  fit.  Many  large  |d'jj.l  cast  doubt  upon  the 
model  but  retaining  the  sign  of  djj.  allows  patterns  of  under  or  over  fitting  to  be 
revealed.  If  the  f(Yr|y(r))  are  assumed  approximately  normal  then  a  normal 
plot  of  the  d'jj.  may  be  informative  as  well. 

(ii)  g2(Yr:  yr)  =  lAr(Yr)  where  Ar  =  (-»,  yr]  yielding  djr  =  P(Yr  <  yrlycr))- 

Viewing  yr  as  a  random  observation  firom  f(Yr|y(r))  implies  dar  ~  U(0,1). 
Because  of  the  dependence  amongst  the  djr  it  would  be  wrong  to  expect  them  to 
exhibit  the  spread  associated  with  independent  uniform  samples.  Nonetheless  an 
adequate  model  should  manifest  djr  which  are  roughly  centered  about  .5  without 
many  extreme  values.  Evidence  to  the  contrary  calls  the  model  to  question. 

(iii)  g3(Yr:  yr)  =  le/Yr)  where  Br  =  {Yr  :  f(Yr|y(r))  <  f(yr|y(r))}  }de!disg  d^r  = 

P(Br|y(  r) ).  Again  viewing  yr  as  a  random  observation  firom  f(Yr|y(r) )  implies 
f(yr|y(r))  is,  itself,  a  random  realization  of  f(Yr|y(r))  whence  dsr  ~  U(0,1). 
Again  the  dsr  should  be  roughly  centered  about  .5  without  many  extreme  values. 
The  dsr,  adapted  from  Box  (1980)  also  appear  in  Geisser  (1987).  Note  that  Yr 
need  not  be  univariate  in  this  definition;  g3  may  be  used  for  other 
cross-validation  schemes.  In  fact  Box  proposed  use  of  gs  to  assess  the  entire 
joint  predictive  distribution.  Unfortunately,  calculation  of  this  multivariate 
probability  will  generally  be  difficult.  This  same  measure  is  referred  to  as  the 
surprise  index  in  Aitchison  and  Dunsmore  (1975). 

Assuming  that  Yr  is  univariate  and  that  f(Yr|y(r) )  is  unimodal,  the  dsr  calculate 
a  set  of  tail  areas.  If  we  further  assume  that  f(YrIy(r))  is  approximately  a 
normal  density,  then  the  event  Bp  is  approximately  the  event  {(Yr-/ir)^/<^T  - 
(d'lr)  }•  Thus  d3r  »  P(Xi  ^d'lr)  )  relating  dir  and  d  3r.  In  retaining  the  sign  of 
the  deviation,  du-  is  preferable  to  the  induced  dsr- 

(iv)  gg(Yr;  yr)  =  (2c)‘‘  lc^(t)(Yr)  where  Cr(c)  =  {Yr :  yr  -  «  <  Yr  <  yr  +  «}  yielding 
d4r(«)  =  (2e)'‘  P(Cr(e)|y(r))-  To  avoid  specification  of  e  we  take  the  limit  as 
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e 0  obtaining  d4r  =  f(yr|y(r))-  This  quantity  dates  at  least  to  Geisser  and 

Eddy  (1979)  and  is  computed  in  the  definition  of  Br.  In  the  case  of  conditionally 

n 

independent  Yr  given  0,  they  suggest  the  use  of  II  d4r  as  a  modification  of  f(y) 

r=l 

n 

for  assessing  comparative  validity  of  models.  We  might  call  II  d4r  a 

r=l 

pseudo— predictive  distribution  or  pseudo-marginal  likelihood. 

Many  small  d4r  criticize  the  model  but  it  may  not  be  obvious  what  a  small  d4r  is. 
Following  an  idea  of  Berger  (1985,  p  201)  we  might  instead  consider  a  relative 
likelihood  leading  to  a  modified  d4r  such  as  d^j  =  d4r/sup  f(yly(r))  or  d"  = 

y 

^<r/E(f(Yr|y(  r)  )  lyt  r)  )■ 

2.3  Model  Choice 

The  standard  Bayesian  approach  for  model  selection  goes  as  follows.  Suppose  there 
are  J  proposed  models  with  model  Mj  denoted  as  f(Y|  Oy,  X,  Mj)  •  5r(^j).  If  wj  denotes  the 
prior  probability  of  Mj  then,  by  Bayes  theorem,  the  posterior  probability  of  Mj  is 

J 

p(Mj|  Y)  =  f(YlMj)  -  Wj/  E  f(Y|Mj)  •  wj  (2) 

j=l 

where  f(Y|Mj)  is  the  predictive  or  joint  marginal  distribution  of  the  data  under  model  Mj. 
For  observed  y  the  model  jdelding  the  largest  p(Mj|y)  is  selected.  Calculation  of  (2)  is 
discussed  in  Section  3.  Geisser  and  Eddy  (1979)  suggest  a  aoss  validation  version 
replacing  f(YlMj)  in  (2)  by  the  pseudo-predictive  distribution. 

There  is  a  fundamental  complication  engendered  in  this  formalism  which  was 
recognized  as  early  as  Bartlett  (1957)  and  elegantly  clarified  by  Pericchi  (1984).  Some 
models  are  implicitly  disadvantaged  relative  to  others  using  this  approach  even  under  a 
state  of  presumed  indifference  towards  the  models  i.e.  wj  =  1/J,  j=l,*  •  -  .J.  Section  2.3.1 
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further  investigates  this  complication  including  an  extension  of  Pericchi’s  remedy.  An 
alternative  remedy  is  considered  in  Section  2.3.2  also  using  ctoss— validation  ideas. 

Another  criticism  is  that,  in  most  practical  situations,  we  doubt  that  anyone 
including  Bayesians  would  select  models  in  this  fashion.  One  doesn’t  really  believe  that 
any  of  the  proposed  models  are  correct  whence  attaching  a  prior  probability  to  an 
individual  model’s  correctness  seems  silly.  The  selection  process  is  typically  evolutionary 
with  comparisons  often  made  in  pairs  until  a  satisfactory  choice  (in  terms  of  both 
parsimony  and  performance)  is  made.  Such  pairwise  decisions  would  be  made  using  the 
Bayes  factor,  f(Y|  Mi)/f(Yl  Mj).  But  if  at  least  one  of  the  is  vague  interpretation  of 
this  factor  is  problematic.  A  possible  remedy  is  suggested  in  Spiegelhalter  and  Smith 
(1982)  using  a  reserved  or  imaginary  training  data  set.  In  Section  2.3.3  we  suggest  simpler 
validation  criteria  based  on  the  ideas  in  Section  2.2  and  in  the  spirit  of  Box  (1980,  p.  427). 

2.3.1  Nentralmng  differential  expected  increase  in  information 

A  simple  example  due  to  Bartlett  (1957)  reveals  a  difficulty  with  the  Bayes  factor 
and  the  standard  procedure.  Suppose  under  Model  1,  that  Yi,«  ••,¥!,  are  i.i.d.  N(0,1) 
while,  under  Model  2,  Yi,-  •  -  .Yn  are  i.i.d.  N(^,l)  with  6  «  N(0,r'^).  Then  regardless  of  the 
data  Y,  of  n,  and  of  wj,  as  m,  f(Y|Mi)/f(Y|M2)  -»  od  and  p(Mi|Y)  -*  1.  This  example 
was  extended  to  more  general  nested  normal  models  in  Smith  and  Spiegelhalter  (1980). 
Pericchi  (1984)  identified  the  source  of  the  complication:  for  a  given  experiment  the 
expected  increase  in  information  about  the  model  parameters  varies  with  the  specification 
of  the  model.  His  remedy  is  to  weight  the  Bayes  factor  or  to  revise  the  prior  probabilities 
Wj  to  achieve  neutral  discrimination  with  regard  to  what  is  expected  to  be  learned  about  0^ 
under  model  Mj. 

In  particular,  using  the  usual  information  entropy  measure,  the  information  in  the 
prior  is  -  J*  log7r(^  d^  (making  the  rather  strong  assumption  that  this  integral  exists), 

the  information  in  the  posterior  is  -J  x(tf|Y)  log  x(tf|Y)  d^  whence  the  expected  increase 
in  information  about  $  from  the  experiment  (Lindley,  1956)  is 
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I(f,x)  =  f{fii0\Y)  log  Y)  f(Y)  dY  -/ 7r{0)  log  Tri(f)  dB. 

For  two  models  with  I(fi,7ri),  l{U,‘ir2)  Pericchi  (1984)  proposes  revision  of  wi  to 
w'^  =  exp(I(fi,xi))/{exp(I(fi,Ti))  +  exp(I(f2,T2))}.  Equivalently  the  Bayes  factor  would  be 
multiplied  by  p  =  Wj/l-wj). 

Under  the  linear  model  Y  =  X0  +  c  with  X  ,  t  ~  N(0,o-^I),  known  and 

n*p, 

9  ~  N(p^  follows  from  Stone  (1959)  that 

I(f,ir)=^log(|I  +  V^X’'Xi).  (3) 


For  the  example  in  Section  4  we  propose  model  choice  between  two  nonlinear  models  with 
normal  errors.  But  if  we  replace  the  mean  of  Yr,  XrB,  by  no  longer  has  the 

closed  form  (3).  However  a  first  order  approximation  may  be  readily  obtained.  Assuming 
dipjdOi  exists  for  ail  i=l,2,  •  •  •  ,p  we  may  write 

P 

^Xr;^0=  ¥<Xr;<?o)+ S  (^i  -  M  • — 

i=l  dB\  Bo 


so  that  Y'  »  E  ari(X)-^i  +  e  where  Y'  =  Yr  -  if{Xr]9o)  -  Sari(X)-^oi  and 
dv{y^r,0) 

ari(X)  = -  .  Hence  using  (3)  I(f,T)  »  j  log(|I  +  A  A|)  where  A  is  an  n*p 


dOi  \Bo 

matrix  such  that  Ari  =  ari(X).  A  practical  choice  for  Bo  might  be  the  MLE.  For  two 


nonlinear  models  the  resultant  weight  p  =  {|I  +  Vj^^A^Ai|/|I  +  V^^A2A2|}^.  We  may 
pass  to  noninformative  prior  specifications  by  setting  =  V  and  letting  V  *  0 

whence  p  =  {|  AjAi|/|A2A2|}^. 


2.3.2  A  maximum  expected  utility  approach 

An  alternative  remedy  modifies  examination  of  (2)  by  formulating  the  problem  of 
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model  choice  as  one  of  maximizing  expected  utility.  Several  authors  (Box  and  Hill,  1967; 
San  Martini  and  Spezzaferri,  1984;  Poskitt  1987)  discuss  such  an  approach.  The  crucial 
concept  is  the  introduction  of  a  utility  functional  to  capture  "the  utility  of  a  model  given 
the  data".  Utility  structures  incorporating  posterior  distributions  as  an  argument  wiU  have 
limited  appbcability  for  model  choice  since  the  parameter  vector  may  be  interpreted 
differently  from  model  to  model.  Use  of  predictive  distributions  avoids  this  problem.  San 
Martini  and  Spezzaferri  (1984)  take  the  utility  of  the  predictive  distribution  at  a  future 
unobserved  Yq,  U(f(Yo|Y),  yo),  where  yo  is  the  true  unknown  future  value.  From  an 
argument  in  Bernardo  (1979)  they  recognize  that  the  unique  proper  local  utility  function 
has  the  form  bo  f(yo|  Y)  +  bi(yo). 

In  choosing  between  two  models  the  expected  utility  solution  is  to  select  the  model 
yielding  the  larger  expected  utility.  It  turns  out  that,  regardless  of  bo  and  bi(yo),  we 
choose  Mi(M2)if  wj  K''f(Yo|y,Mi),  f(Yo|y,M2))  >  (<)  W2  K(f(Yo|y,M2),  f(Yo|y,Mi)) 
where  K(fi,f2)  =  fU  denotes  the  KuUback-Leibler  divergence  between  the 

densities  fj  and  £2.  This  criterion  is  appealing  since  K(fi,f2)  is  interpreted  as  the  expected 
or  average  information  for  discriminating  in  favor  of  fi  against  {2.  In  the  cross  validation 

context  we  replace  K(f(Yol  Y,Mj),  f(YolY,Mj'))  with  2  K(f(Yrly(r)  ,Mj),f(Yrly(r)  ,Mj')). 

r=l 

This  substitution  arises  by  replacing  f(Yo|y,Mj)  with  the  pseudo-predictive  distribution, 
n 

D  f(Yr|y(  r)  ,Mj),  as  discussed  in  (iv)  of  Section  2.2.  An  alternate  form  is 
r=l 


choose  Mi(M2)  if  Ep 


r  n  {{Yr\7iT)Mih 

log - 

n  f(Y,|yu),M2)J 


>(<)0 


(4) 


where  f*  =  will  f(Yr|y(r)  ,Mi)  +  W2II  f(Yr|y(r)  .M?)- 

Calculation  of  the  Kullback— Leibler  divergences  is  discussed  in  Section  3.  Other 
information  measures  (see  e.g.  Csiszar,  1977)  could  be  investigated  as  well.  The  expected 
utility  approach  is  readily  extended  to  J  >  2  models. 
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2.3.3  Ad  hoc  procedures 

Each  of  the  criteria  developed  in  Section  2.2  may  be  converted  to  an  ad  hoc  model 
choice  procedure.  Given  two  models,  for  k=l,  2,  3,  4  associate  dkr{Mj)  with  model  Mj, 

j=l,  2. 

For  k  =  1  choose  Mj  with  the  smaller  value  of  Djj  =  E(dJj.(Mj)) 

For  k  =  2,  3  choose  Mj  with  the  smaller  value  of  Dkj  =  E(dkr(Mj)  -  .5)^ 

For  k  =  4  choose  Mj  with  the  larger  value  of  IId4r(Mj).  Equivalently 

rnd4r(M,)l 

choose  Mi(M2)  according  to  D4  =  log -  >  (<)0  (5) 

nd4r(M2). 

Expression  (5)  may  be  directly  compared  with  (4).  For  the  latter  we  calculate  expected 
utilities;  for  the  former  we  calculate  observed  utilities.  In  fact  exp(D4)  is  a  pseudo-Bayes 
factor.  Given  that  the  dkr(Mj)  will  have  already  been  calculated  for  use  in  model 
assessment  the  additional  computation  for  their  use  in  these  ad  hoc  model  choice 
procedures  is  negligible. 

3.  Computational  Approaches 

We  propose  the  use  of  sampling— based  methodology  to  calculate  the  various  objects 
of  interest  in  Section  2.  Monte  Carlo  techniques  have  significantly  advanced  our  ability  to 
carry  out  integrations  required  for  Bayesian  inference.  The  literature  for  noniterative 
methods  is  substantial.  We  mention  here  the  recent  papers  of  Rubin  (1988),  Geweke 
(1989)  and  West  (1990).  The  paper  of  Geweke  provides  many  further  references.  Iterative 
approaches  are  discussed  in  Tanner  and  Wong  (1987)  and  in  Gelfand  and  Smith  (1990). 

3.1  Monte  Carlo  estimates  of  the  dr 

For  a  given  model,  computational  effort  focuses  on  the  calculation  of  the  dr  = 
E(g(Yr;yr)|y(r))  =  / /  g(Yr;yr)f(Yr|^,y(r)  ,X)-ir(0|y,r))  dtfdYr. 
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If  {0s,Yts),  s  =  are  samples  frcm  the  joint  conditional  distribution  for  0  and 

Yr,  f(Yr|^,y(r),X)*7r{0|y(r))  then  a  Monte  Carlo  approximation  for  dr  is 
B 

dr  =  B  S  g(Yrs;yr).  Sampling  from  f(Yr|  tf,y(r)  ,X)  is  usually  no  problem;  sampling 
s=l 


from  7r(  0\  j(  r) )  is.  We  return  to  this  matter  shortly. 

If  £r{»  ;y)  =  /g(Yr  ;  yr)  f(Y,|«,y,„,X)  dY,  then  d,  =  E(£,(d  ;y)|y,ri),  “ 

expectation  with  respect  to  the  posterior  x(5|y(r)).  In  certain  cases  £^0  ;y)  can  be 

B 

calculated  explicitly  whence  dr  =  B  E  ^r(^s;y)-  We  need  not  draw  the  sample  of  Yrs 

s=l 


Such  savings  in  random  variate  generation  is  referred  to  as  streamlining  in  Rubin  (1988) 
and  in  Gelfand  and  Smith  (1990).  In  fact  the  estimate  of  the  predictive  density  itself, 

f(Yr|y(r)),  requires  only  the  tfs,  i-e.,  f(Yr|y(r))  =  B  E  f(Yr| y(r) ,X). 

s=l 

If  h(^  is  an  importance  sampling  density  for  x(^|y(r) )  and  s  =  1,*  •  ^B  are  drawn 

B 

from  h,  the  above  Monte  Carlo  estimates  are  modified  to  dr  =  S  g(Yrs,yr)  'Vrs,  or 

S=:l 

B  .  B 

dr=  E  5r(^s:y)-Vrs  and  f(Yr|y(r))  =  s  f(Yr|^s,y(r),X)-Vni  respectively  where 
s=l  s=l 


Vrs  = 


B 

S  7r(^s|y(r))/h(5s) 

S=1 


-1 


•7r(t?s|y(r))/h(^s)-  As  a  related  remark,  if,  for  example, 

f(Yr|  ^,y(r)  ,X)  is  a  normal  distribution  then  f(Yr|y(r))  is  a  finite  mixture  of  normals. 
Theory  developed  in  Johnson  and  Geisser  (1983)  shows  that  in  such  situations  f(Yr|y(r) )  is 
exactly  or  approximately  a  t— distribution.  But  t— distributions  arise  as  scale  mixtures  of 
normal  distributions  which  can  be  arbitrarily  weU  approximated  by  a  finite  mixture  of 
normals. 

Note  that  7r(<?|y(r))  «  Ti^/KvA  ^,y(r),X)  where  r{0)  =  {{j\0,X)’j{ff)  so  that 


r(l?s)/(h(gs)-f(7r|g,y(r,.X)) 

i  r(y5)/(h((I,).f(y,|y,y,„.X)) 
s=l 
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Rather  than  develop  an  h(0)  for  each  5r(0jy(r))  it  would  be  more  efficient  to  find  a  simple 
choice  which  we  could  sample  and  then  use  for  all  r.  The  form  of  Vrs  suggests  h(fl)  «  r{S) 
i.e.  h(fl')  =  7r(01y)  would  be  a  natural  choice.  We  recall  that  the  Gibbs  sampler,  as 
described  in  Gelfand  and  Smith  (1990)  for  application  to  hierarchical  Bayes  models, 
produces  observations  essentially  from  the  joint  posterior  7r(^ly).  Hence,  if  the  Gibbs 
sampler  is  used  to  carry  out  Bayesian  inference  under  the  given  model,  the  outputted  Os 
can  be  used  directly  as  input  to  carry  out  computations  needed  for  studying  model 
adequacy  and  model  choice.  Implementation  of  the  Gibbs  sampler  for  challenging  models 
will  require  tailored  versions  of  the  rejection  method.  See  Carlin  and  Gelfand  (1991),  Gilks 
and  Wild  (1991),  Ritter  and  Tanner  (1991).  If  a  noniterative  approach  has  been  employed 
resulting  in  an  importance  sampling  density  h(^  for  T{ff)  then  the  samples  from  h(^  can  as 
well  be  used  directly  in  the  above  formulas.  There  is  considerable  literature  on  the  creation 
of  a  good  importance  sampling  density.  In  particular  we  note  the  recent  work  of  Geweke 
(1989)  and  West  (1990). 

For  model  choice  additional  calculations  we  may  wish  to  make  are  the 
Kullback— Leibler  divergences  K(f(Yrly(  f{Yrly,  r),Mj')).  Since  these  are 

expectations  with  respect  to  f(Yr|y(r)  ,Mj),  in  principle  they  can  be  handled  in  the  same 
way  as  calculation  oi  the  dr(Mj).  However,  in  practice,  the  calculation  requires  enormous 
storage,  even  if  n  is  small,  since  each  f  is  itself  a  Monte  Carlo  estimate  and  since  these 
estimated  f’s  are  created  under  different  models  but  must  be  merged  to  calculate  K. 

The  standard  approach  to  model  choice  requires  updating  of  wj  to  p(Mj|y)  as  in  (2). 

Except  in  simple  cases  the  marginalization  to  obtain  f(Y|Mj)  is  not  available  in  closed 

form.  Noniterative  Monte  Carlo  integration  may  be  employed  as  follows.  If  x(®i)  proper 

B 

and  (?js,  s  =  1,  -  •  •  ,B  are  a  sample,  f(Y(Mj)  =  B'^  E  f(y|  ^jsiX).  If  t  (O^)  is  improper 

s=l 

but  h(0j)  is  an  importance  sampling  density  for  T(fi^),  t  defined  above,  then  f(YlMj)  = 
.  B 

S=1 
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Interestingly,  the  Gibbs  sampler  is  less  attractive  here.  By  itself,  it  does  not  readily 
produce  an  estimator  of  f(Y|Mj).  For  a  collection  of  J  models  it  need  not  uniquely  provide 
the  posterior  probabilities  p(Mj|Y)  since  Markovian  updating  using  the  llj,  j  =  1, 
and  a  variable  M  to  label  the  model  may  violate  conditions  for  convergence  (see  e.g., 
George  and  McCulloch,  1991). 

3.2  Simplified  sampling  for  nonlinear  normal  modds 

Suppose  the  model  Yr  =  ^Xr;  4-  Cr.  r  =  !>•••>  n  where  the  vector  of  errors,  e  ~ 
N(0,  a^-W),  W  known  positive  definite.  With  6=  (^,  a^)  suppose  t{0)  =  xi(^  •  T2(cr^) 
where  T2(a^)  is  inverse  gamma  IG(a,b)  i.e.  T2(cr^)  «  exp(-b/o-^)/((7^)^*‘.  We  allow  the 
improper  limiting  cases  as  a-tO  and  as  b-tO.  Then  7r(d|Y)  =  ti(^|Y)  •  t2(<7^1^,Y)  where 
7r2(<r^|AY)  is  IG(a',bO  with  a'  =  a  +  n/2,  b'  =  b  +  ^  (Y  -  W'^(Y-  y>),  y) the  vector 

of  <KXr;  0)  and  Y)  «  7r,(/3)/(bO^'. 

Suppose,  after  transformation  of  0  to  domain  R*',  that  a  noninformative  prior  is  taken 
for  0  i.e.  iri{0)  =  1  (as  in  e.g.  Johnson  and  Geisser,  1983,  p.  138).  If  g(Xr;  0}  =  Xt0  then, 
as  is  well  known  (see,  e.g..  Box  and  Tiao  1973,  p.  117),  2ri(^|Y)  is  exactly  a  multivariate 
student  t— distribution  and  sampling— based  approaches  are  not  heeded.  For  the  nonlinear 
case  let  (p{0)  =  (Y  -  ^)^W  ^(Y-^)  and  let  0  be  the  MLE  for  0  whence  <l>{0)  is  the  error  or 
residual  sum  of  squares  for  the  model.  Assuming  derivatives  exist,  to  a  second  order 
approximation,  <p{0)  »  (p{0)  +  ^0  -  0)^  ^{0-0)  where  H  has  entries  Hut  =  ^^d0t  ‘0  ‘  ^ 
is,  of  course,  proportional  to  the  inverse  of  the  sample  Fisher  information  matrix.  At  the 
least  standard  nonlinear  regression  software  handles  the  independence  case  (W  =  I)  and 

routinely  provides  0,  ^(^),  and  (H*)  ^  =  2a^H'*,  the  estimated  as3rmptotic 

covariance  matrix  of  0.  In  iri(^|Y),  replacing  (f){0)  by  this  approximation  again  yields  a 
multivariate  student  t-distribution,  say  t{0). 

For  noniterative  Monte  Carlo  we  immediately  have  a  promising  importance  sampling 
density  for  x((?|Y)  namely  t{0)  •  'k^{&^\0,Y).  The  work  of  Van  Dijk  and  Kloek  (1985), 
Geweke  (1989)  and  West  (1990)  suggests  refinements  to  X(0).  Simplification  occurs  for  the 
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Gibbs  sampler  as  well  since  it  may  be  applied  directly  to  ti(^1  Y)  using  X{0)  as  described  in 
Carlin  and  Gelfand  (1991).  The  resulting  Gibbs  replicates  say  Ps  would  then  be  used  to 
sample  from  T2(a  |  AY)  to  obtain  Bg.  The  illustrative  example  in  Section  4  is  handled 
using  noniterative  Monte  Carlo.  Other  models  may  admit  similar  conjugades  which 
ameliorate  the  computing  burden. 

4.  An  iUnstiative  example 

Our  example  compares  two  sigmoidal  growth  curve  models  of  the  same  dimension. 
Consider  the  following  data  (Ratkowsky,  1983,  p.  88)  recording  as  Y  the  dry  weight  of 
onion  bulbs  plus  tops  versus  growing  time  X. 

X123456789 

Y  16.08  33.83  65.80  97.20  191.55  326.20  386.87  520.53  590.03 

X  10  11  12  13  14  15 

Y  651.92  724.93  699.56  689.86  637.56  717.41 


The  data  is  plotted  in  Figure  1  suggesting  sigmoidal  behavior.  We  propose  to  investigate 
model  adequacy  for  and  choice  between  a  logistic  model,  Yr  =  Po{l  +  P\l^)'^  +  fr  and  a 

-3  3^^  2 

Gompertz  model,  Yr  =  Po  +  Cr  where  in  either  case  we  assume  the  Cr  iid  N(0,a  ), 

r  =  1,2,*  •  *15.  Under  either  model  Po  is  interpreted  as  an  asymptote  while  P2  €  (0,1).  We 

take  P\>  Qio  yield  an  increasing  function  of  X.  In  both  cases  we  reparametrize  to  R  by 

setting  p\  =  log  Pu  P\  =  log  {P^Hl-P^))  and  then  taking  the  prior  r{Po,  P\,  P^,  <^^)  = 

(a^)'*.  In  the  notation  of  the  previous  section  ti(A,  Ap  P'2)  =  1  ’r2(n^)  =  IG(0,0). 

The  results  of  a  standard  nonlinear  regression  fitting  package  (SAS  PROC  NLIN)  for 
each  model  are  given  in  Table  1.  These  estimates  were  used  to  obtain,  for  each  model,  a 
multivariate-t  distribution  which  was  then  used  as  an  importance  sampling  density  for  the 
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noniterative  Monte  Carlo  approach  described  in  Section  3.2  with  B  =  2000.  Table  2 
provides  the  predictive  means,  E( Yr|  yj  r) ),  and  the  dir  for  each  model. 

Table  2  reveals  that  Xr  =  14  and,  to  a  lesser  extent,  Xr  =  11  are  troublesome  points 
under  both  models.  Plots  of  djr  vs  Xr  and  d4r  vs  Xr  for  both  models  reveal  no  systematic 
patterns.  For  model  1  (logistic)  d2  =  .4978,  ds  =  .6076;  for  model  2  (Gompertz) 
d2  =  .5742,  da  =  .5997.  For  illustration.  Figure  2  presents  boxplots  of  d2r  --5  for  each 
model.  Turning  to  the  criteria  of  Section  2.3.3  we  have  Du  =  18.27,  Di2  =  16.82; 
D21  =  .9219,  D22  =  1.3160;  D31  =  1.5657,  D32  =  1.9487  and  D4  =  1.6863.  All  told,  both 
models  seem  to  provide  adequate  fit  with  model  1  being  preferable. 

5.  Conclusions 

The  predictive  techniques  proposed  here  for  model  checking  and  model  choice  are 
self-contained  with  respect  to  the  experiment,  accommodate  both  proper  and  improper 
priors,  employ  only  univariate  distributions  and,  using  sampling-based  methods  are  readily 
computed.  The  Monte  Carlo  technology  described  here  can  be  straightforwardly  modified 
for  use  in  other  predictivist  enterprise  such  as  prediction  of  future  observations,  diagnostics 
for  outlier/influential  point  detection  (Johnson  and  Geisser,  1982,  1983)  and  optimal 
combination  of  models  for  prediction  (Min  and  Zellner,  1990). 

Methodology  for  effective  model  assessment  and  selection  is  available  and 
implementable.  As  the  art  of  Bayesian  data  analysis  evolves  and  more  challenging 
problems  are  tackled,  judicious  use  of  this  methodology  should  become  a  standard 
component  of  the  data  analysis  process. 
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Table  1:  Maximum  Likeliliood  Estimation  for  the 
Two  Sigmoidal  Growth  Curve  Modds  of  Section  4. 
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Logistic  Model: 

^0  =  702.876  I3\  =  4.454 

(t>Cp)  =  8913.991 


■  193.741 

-1.107 

-0.872  ■ 

-1.017 

0.058 

0.026 

-0.872 

0.026 

0.0133 

J 

Gompertz  Modd: 

h  =  723.059  P\  =  2.502 

<t)0)  =  13616.000 


■  486.053 

-3.842 

2.311  ■ 

# 

t 

II 

-3.842 

0.0813 

0.039 

2.311 

0.039 

0.020 

=  -0.008 

=  742.833 


=  0.564 
7  =  1134.667 
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Table  2:  Predictive  means  and  dir  for  Models  1  and  2 


Xr  yr 


Model  1 

E(Yr|y(r))  djj.  dir  dsr 


1  16.08 

2  33.83 

3  65.80 

4  97.20 

5  191.55 

6  326.20 

7  386.87 

8  520.53 

9  590.03 

10  651.92 

11  724.93 

12  699.56 

13  689.96 

14  637.56 

15  717.41 


15.88 

31.00 

59.19 

110.26 

188.42 
286.56 
429.51 
524.68 
602.30 

647.43 

665.16 

687.16 
697.74 
708.65 
698.73 


0.0066 

0.0965 

0.2239 

-0.4347 

0.0967 

1.1858 

-1.3458 

-0.1244 

-0.3913 

0.1433 

2.2774 

0.3931 

-0.2425 

-2.9761 

0.5913 


0.3846 

0.6125 

0.5091 

0.3447 

0.5976 

0.8257 

0.0963 

0.4985 

0.3181 

0.5822 

0.9832 

0.6559 

0.3779 

0.0012 

0.6784 


0.9999 

0.7947 

0.8711 

0.5418 

0.8754 

0.3467 

0.2663 

0.8740 

0.6780 

0.8910 

0.0210 

0.6893 

0.6734 

0.0067 

0.5843 


Model  2 


Xr 

yr 

E(Yr|y(r,) 

du 

djr 

d3r 

1 

16.08 

0.38 

0.4745 

0.8105 

0.4650 

2 

33.83 

5.40 

0.8779 

0.4938 

0.6854 

3 

65.80 

30.25 

1.1041 

0.9247 

0.1677 

4 

97.20 

97.22 

-0.0007 

0.6426 

0.9840 

5 

191.55 

202.20 

-0.2702 

0.4809 

0.9438 

6 

326.20 

314.97 

0.2897 

0.7406 

0.8339 

7 

386.87 

436.20 

-1.3458 

0.0698 

0.1539 

8 

520.53 

516.31 

0.1157 

0.4172 

0.9202 

9 

590.03 

583.19 

0.1916 

0.7452 

0.8931 

10 

651.92 

628.85 

0.6543 

0.8718 

0.6351 

11 

724.93 

655.60 

2.1501 

0.9908 

0.0207 

12 

699.56 

682.69 

0.4533 

0.5509 

0.7938 

13 

689.96 

701.02 

-0.2998 

0.2456 

0.5169 

14 

637.56 

717.28 

-2.6886 

0.0034 

0.0038 

15 

717.41 

713.76 

0.0947 

0.6243 

0.9778 

d4r 

0.0364 

0.0361 

0.0349 

0.0314 

0.0331 

0.0143 

0.0110 

0.0314 

0.0304 

0.0336 

0.0028 

0.0304 

0.0323 

0.0006 

0.0266 


d4r 

0.0280 

0.0205 

0.0158 

0.0299 

0.0275 

0.0258 

0.0107 

0.0289 

0.0292 

0.0234 

0.0026 

0.0265 

0.0276 

0.0008 

0.0287 


Y 


750+ 


500+ 


250+ 


* 


* 


* 


0+  * 


*  * 


■+ 


2.5 


* 


+ 


5.0 


■+ 


7.5 


10.0 


12. 


Figure  1:  Plot  of  onion  bulb  data 
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Discnssioit 

Adrian  E.  Raftery  (Univeisity  of  Washington) 

1.  Introduction  and  summary 

It  is  a  pleasure  to  congratulate  the  authors  on  an  interesting  and  important  paper 
that  points  out  how  sampling— based  methods  can  make  Bayesian  diagnostics  for  model 
checking  routinely  available.  Bayesian  diagnostics  are  often  similar  to  frequentist  ones,  but 
they  have  the  great  advantage  of  being  systematically  available  through  the  predictive 
distribution,  even  for  complex  models.  This  is  in  contrast  with  frequentist  diagnostics, 
which  have  to  be  developed  from  scratch  for  each  new  class  of  models,  often  requiring 
considerable  ingenuity.  The  interpTetation  of  Bayesian  diagnostics  is  somewhat  glossed 
over  by  the  authors,  however. 

We  part  company  to  some  extent  on  the  issue  of  model  choice.  I  am  unconvinced  by 
the  arguments  against  the  standard  Bayesian  procedure,  namely  that  based  on  posterior 
model  probabilities.  New  results  indicate  that  posterior  model  probabilities  can  be  readily 
computed  using  sampling— based  methods.  Also,  the  standard  Bayesian  prf'cedure  w  based 
on  predictive  distributions,  in  a  prequential  rather  than  a  cross-validation  sense. 

2.  Bayesian  diagnostics  for  model  checking 

A  real  achievement  of  this  paper  is  to  show  how  sampling— based  methods  can  be  used  to 
obtain  Bayesian  diagnostics  systematically  and  routinely  for  a  very  wide  class  of  models. 
When  frequentist  diagnostics  are  available  they  are  often  similar  to  Bayesian  diagnostics. 
The  great  advantage  of  Bayesian  diagnostics  is  that  they  are  available  quite  generally  from 
the  predictive  distribution,  unlike  their  frequentist  counterparts,  which  can  require 
considerable  ingenuity  for  each  new  class  of  models. 

The  authors  have,  however,  rather  glossed  over  the  interpTetation  of  their  diagnostics. 
For  example,  in  he  nonlinear  regression  example,  they  conclude  that  points  11  and  14  are 
troublesome  but  that,  all  told,  both  models  provide  an  adequate  fit.  What  is  the  basis  for 
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this  conclusion?  Nothing  is  suggested  beyond  eyeballing  the  results,  but  there  are  certainly 
more  precise  criteria  implicitly  at  work  here,  and  they  should  be  made  explicit. 

I  would  suggest  that  diagnostics  not  be  used  to  reject  the  current  model,  but  rather  to 
guide  the  search  for  better  models  by  indicating  the  direction  of  search,  or  the  way  in 
which  the  current  model  is  inadequate.  If  this  leads  to  the  specification  of  an  alternative 
model,  then  the  current  model  can  be  compared  with  alternative  one  using  the  posterior 
odds  ratio  (or  posterior  expected  utilities  of  these  can  be  specified);  the  current  model  will 
not  be  rejected  unless  the  alternative  one  is  decisively  preferred.  You  don’t  abandon  a 
model  unless  you  have  a  better  one  in  hand. 

Even  viewing  diagnostics  this  way,  as  an  exploratory  tool  rather  than  as  a  basis  for 
inference,  we  still  need  some  yardstick  to  calibrate  our  inspection  of  the  results.  Here  it 
does  seem  that  frequentist  calculations  are  useful,  and  I  suspect  that  such  calculations 
implicitly  underly  the  authors’  interpretation  of  the  results  in  their  Table  2. 

3.  Model  comparison:  In  support  of  the  standard  Bayesian  procedure 
The  standard  Bayesian  procedure  is  given  by  the  authors’  equation  (3),  and  amounts  to 
basing  inference  on  the  posterior  model  probabilities.  They  raise  two  objections  to  this 
procedure,  which  I  will  now  briefly  discuss. 

3.1  "Bartlett’s  paradox" 

This  is  the  observation  due  to  Bartlett  (1957)  that  if  under  Mi  and  Yi  are  i.i.d.  N(0,1),  and 
under  M2  they  are  i.i.d.  N(t>,l)  with  6  N(0,r^),  then  p(Mi|  Y)  -*  1  as  -kd  regardless  of 

the  data;  see  the  authors’  section  2.3.1. 

This  has  been  presented  by  the  authors  and  by  others  that  they  cite  as  a  major  flaw 
of  the  standard  Bayesian  approach,  but  I  do  not  find  it  too  disquieting.  Letting  r  -•  od 
implies  that  E[|  ^|]  also  becomes  arbitrarily  large,  so  it  is  not  too  surprising  that,  for  any 
data  set,  E[|  ^|]  can  be  set  large  enough  that  the  data  prefer  zero.  Some  prior  information 
is  almost  always  available  that  wiU  limit  the  prior  variance  t  ,  and  it  is  always  important 
to  investigate  the  sensitivity  of  p(Mi(Y)  to  changes  in  r^.  In  practice,  p(Mi|Y)  tends  to 
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ry 

be  rather  insensitive  to  changes  in  r  over  a  wide  range  (see,  e.g.,  Raftery,  1988).  Thus, 
Bartlett’s  paradox  seems  to  me  to  suggest  that  the  use  of  highly  diffuse  priors  is  not  a  good 
idea  for  model  comparison. 

It  may  be  objected  that  it  is  desirable  to  have  a  "reference"  procedure  for  model 
comparison.  However,  in  my  applied  experience,  reasonable  proper  priors  are  often  readily 
accepted,  especially  when  backed  up  with  a  serious  sensitivity  analysis;  the  likelihood  is 
often  the  more  controversial  part  of  the  analysis. 

3.2  The  more  serious  criticism 
The  authors  write: 

"Another  criticism  is  that,  in  most  practical  situations,  we  doubt  that  anyone 
including  Bayesians  would  select  models  in  this  fashion.  One  doesn’t  really  believe  that 
any  of  the  proposed  models  are  correct  whence  attaching  a  prior  probability  to  an 
individual  model’s  correctness  seems  siUy.  The  selection  process  is  typically  evolutionary 
with  comparisons  often  made  in  pairs  until  a  satisfactory  choice  (in  terms  of  both 
parsimony  and  performance)  is  made." 

Attaching  a  prior  probability  to  a  model  is  not  any  sillier  than  science  as  traditionally 
practiced.  Most  of  science  is  an  attempt  to  find  a  model  that  predicts  the  observations  to 
date  well;  it  does  not  claim  to  have  found  the  "truth"  (if  such  a  things  exists)  or  the 
"correct  model".  Science  typically  proceeds  by  adopting  a  paradigm,  which  means 
essentially  conditioning  on  a  collection  of  models,  often  with  an  explicit  parametric  form. 
Prior  probabilities  conditional  on  the  adopted  paradigm,  or  collection  of  models,  do  make 
sense. 

Of  course,  if  one  does  not  so  condition,  the  prior  probability,  and  hence  also  the 
posterior  probability  of  most  models  is  zero.  Since  one  does  not  believe  the  paradigm  to  be 
the  "truth",  this  may  make  science  as  a  whole  seem  silly,  but  its  record  of  success  argues  in 
its  favor.  Note  that  the  marginal  likelihood,  f(YlMj),  which  is  proportional  to  the 
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posterior  probability  of  Mj,  is  just  the  (predictive)  probability  of  the  data  given  the  model 
Mj,  and  so  is  precisely  the  right  quantity  for  evaluating  the  scientific  theory  defined  by  Mj. 

Consider,  for  example,  the  question  of  whether  smoking  causes  lung  cancer,  and 
suppose  that  the  currently  accepted  way  of  addressing  this  issue  is  within  the  framework  of 
the  logistic  regression  model,  logit(Pr[lung  cancer])  =  7l[smokes]  +  where  x  is  a 
vector  of  control  variables.  Conditionally  on  this  framework  (or  "paradigm"),  the  issue 
becomes  a  comparison  of  the  two  models  Mi  :  7  =  0  and  M2  :  7  >  0.  Then  a  scientist’s 
natural  language  statement  "I  am  90%  sure  that  smoking  causes  lung  cancer"  is  equivalent, 
given  the  framework,  to  the  statement  that  p(Mi)  =  0.1  and  p(M2)  =  0.9.  This  does  seem 
to  make  sense  even  if,  unconditionally  on  the  framework,  p(Mi)  =  p(M2)  =  0. 

Of  course,  the  natural  language  statement  itself  can  be  viewed  as  not  being  about 
"truth",  but  rather  about  future  data  and  trends  in  scientific  opinion.  It  might  mean,  for 
example,  "I  am  90%  sure  that  future  data  will  be  better  predicted  by  M2  than  by  Mi",  or 
"I  am  90%  sure  that  within  T  years  the  belief  that  smoking  causes  lung  cancer  will  be 
generally  accepted";  note  that  the  latter  two  statements  can  be  given  standard  betting 
interpretations.  For  an  ex^ple  where  scientists  might  attach  substantial  prior  probability 
to  the  smaller  ("null")  model,  consider  cold  fusion. 

The  authors  describe  the  standard  Bayesian  procedure  as  a  model  selection  procedure, 
but  it  is  considerably  richer  than  that.  When  comparing  two  models  that  genuinely 
represent  rival  scientific  hypotheses,  the  posterior  odds  ratio  provides  a  summary  of  the 
evidence  for  one  model  against  the  other;  unless  the  evidence  is  very  strong,  one  model  will 
not  necessarily  be  selected. 

Often,  however,  model  form  is  not  the  object  of  primary  scientific  interest.  The 
authors  did  not  say  what  the  main  scientific  question  was  in  their  growth  curve  example, 
but  I  suspect  that  it  was  not  the  choice  between  the  two  models  that  they  considered.  If 
interest  focuses  instead  on  some  other  quantity,  a,  such  as  the  next  observation,  Yk,  or  the 
asymptote,  0^,  then  model  selection  is  a  false  problem,  and  it  is  important  to  take  account 
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of  model  uncertainty.  The  Bayesian  approach  provides  an  immediate  way  of  doing  this 
using  the  equation. 

J 

p(a|Y)=  S  p(AlY,Mj)p(Mi|Y).  (1) 

j=l 

Hodges  (1987)  emphasized  the  importance  of  taking  account  of  modd  uncertainty,  pointing 
out  that  failure  to  do  so  leads  to  the  overall  uncertainty  bdng  underestimated,  and  hence, 
for  example,  to  overly  risky  decisions. 

If  the  posterior  probability  of  one  of  the  models  is  dose  to  unity,  or  if  the  posterior 
distribution  of  a  is  almost  the  same  for  the  models  that  account  for  most  of  the  posterior 
probability,  then  p(A  |  Y)  may  be  approximated  by  conditioning  on  a  single  modd,  namdy 
by  p(a|  Y,Mi)  for  some  i.  This  seems  to  be  the  main  situation  in  which  model  selection,  as 
such,  is  a  valid  exerdse.  The  "evolutionary"  process  to  which  the  authors  refer  is  in  reality 
an  informal  search  method  for  finding  the  main  modds  that  contribute  to  the  sum  in 
equation  (1),  and  in  this  sense  may  be  viewed  as  an  approximation  to  the  full  (standard) 
Bayesian  procedure.  Clearer  recognition  of  this  might  lead  to  more  satisfactory  modd 
search  strategies. 

4.  The  standard  Bayesian  procedure  and  sampling— based  methods. 

The  key  quantity  for  the  implementation  of  the  standard  Bayesian  procedure  is  the 
marginal  likelihood,  f(Y|Mj)  =  J f(Y|  ^j,X,Mj)7r(5j)d9j.  The  authors  say  that  the  Gibbs 
sampler  does  not  readily  produce  an  estimator  of  f(Y)Mj).  However,  Newton  and  Raftery 
(1991)  have  recently  pointed  out  the  existence  of  a  simple  and  general  such  estimator. 
They  show  that,  given  a  sample  from  the  posterior,  the  marginal  likelihood  may  be 
{simulation— consistently)  estimated  by  the  harmonic  mean  of  the  associated  likelihood 
values.  This  result  applies  no  matter  how  the  sample  was  obtained,  whether  directly  using 
the  analytic  form  of  the  posterior,  by  importance  sampling,  the  Gibbs  sampler,  the  SIR 
algorithm  or  the  weighted  likelihood  bootstrap.  There  can  be  stability  problems  with  this 
estimator,  and  slight  modifications  that  avoid  these  are  discussed  in  the  dted  reference. 
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The  standard  Bayesian  procedure  is  a  predictive  approach  since  the  marginal 
likelihood  can  be  written 

f(Y|Mi)  =  D  (2) 

r=l 

where  =  (Yi,*  •  -  lYr-i)-  Note  that  the  conditional  densities  on  the  right^and  side  of 
equation  (2)  are  conditional  on  the  first  (r-1)  observations,  and  not  on  all  the  other  (n-1) 
observations.  Thus  the  standard  Bayesian  procedure  is  a  "prequential"  method  in  the 
sense  of  Dawid  (1984),  and  not  a  cross-validation  approach.  Each  conditional  density  on 
the  right-hand  side  of  equation  (2)  may  be  evaluated  in  a  sampling-based  way,  using  the 
same  methods  as  the  authors  propose  for  their  d4r-  It  follows  that  this  provides  an 
alternative  sampling— based  way  of  calculating  the  marginal  likelihood,  and  hence  of 
implementing  the  standard  Bayesian  procedure. 

Note  also  that  equation  (2)  remains  valid  even  if  the  observations  are  permuted. 
Thus,  even  if  the  model  does  not  impose  a  natural  ordering  on  the  observations, 
"prequential  diagnostics"  may  be  obtained  by  sampling  from  the  set  of  all  permutations  of 
the  observations  and  averaging  over  diagnostics  based  on  the  conditional  densities  on  the 
right-hand  of  equation  (2). 

If  one  replaces  the  conditional  densities  on  the  right-hand  side  of  equation  (2)  by 
densities  conditional  on  all  the  observations  except  the  rth  one,  one  obtains  the  quantity 
that  the  authors  denote  by  D4  =  Il“^jd4r.  This  could  be  called  a  "pseudo-marginal 
likelihood",  by  analogy  with  the  pseudo-likelihood  concept  introduced  by  Besag  (1975). 
Using  D4  rather  than  f(Y|Mj)  is  similar  to  using  the  pseudo-likelihood  rather  than  the 
likelihood  when  the  latter  is  available,  which  does  not  seem  to  be  a  very  good  choice.  As 
an  argument  in  favor  of  D4,  however,  the  authors  point  out  that  with  improper  priors  D4  is 
defined  whereas  f(Y|Mj)  is  not.  This  strikes  me  as  a  disadvantage  of  improper  priors 
rather  than  of  the  standard  marginal  likelihood. 

I  will  attempt  to  summarize  the  various  analogies  and  equivalences  discussed  in  the 
following  table. 


28 


Prequential  analysis 

Cross-validation 

Likelihood 

Pseudo-likelihood 

Marginal  likelihood 

(faiHi)) 

"Pseudo-marginal 
likelihood"  (D4) 

Posterior  model  probability/ 
Bayes  factor 

Fixed-level  significance 
test 

BIC  (Schwarz,  1978) 

AIC,  Cp 

Entries  in  the  same  column  are  regarded  as  being  related,  either  by  being  motivated 
by  the  same  approach  or  by  being  asymptotically  equivalent.  Entries  in  the  same  row  are 
viewed  as  different  approaches  to  the  same  task  or  concept.  I  prefer  the  entries  in  the 
left-hand  column,  headed  "prequential  analysis",  while  the  authors  seem  to  incline  to  the 
entries  in  the  right-hand  column.  Note  that  the  difference  can  be  important,  especially 
with  large  samples. 
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Daniel  Pena  (Univeisidad  Carlos  HI  de  Madrid) 

This  paper  contains  three  features  that  I  really  like:  (1)  it  stresses  the  importance  of  model 
determination  and  model  diagnosis  in  statistical  analysis;  (2)  it  advocates  a 
cross— validatory  assessment  of  the  model  using  the  predictive  distribution;  (3)  it  points  out 
the  difficulties  of  using  a  naive  Bayesian  analysis  for  Model  choice. 

Model  diagnosis  has  received  an  increasing  interest  in  the  Bayesian  literature  and  I 
would  like  to  add  the  works  by  Zellner  (19T5),  Johnson  and  Geissei  (1985)  and  GuUman 
and  Pena  (1988)  to  the  references  given  in  the  paper. 

The  information  about  the  model  adequacy  is  contained  in  the  joint  predictive 
distribution  f(y)  and  a  key  problem  is  to  devise  simple  procedures  to  reveal  this 
information.  A  sensible  first  step  is  the  one  used  in  this  paper  as 

f(y)  =  f(yi|y(i))f(y{i)) 

we  can  look  at  the  cross-validatory  predictive  distribution  f(yi|y(i))  that  is 
unidimensional.  However,  this  procedure,  although  very  useful,  does  not  show  some  of  the 
interesting  multivariate  features  of  the  data,  for  instance,  sets  of  similar  points  that  are 
different  from  the  rest  of  the  data  and  which  cannot  be  identified  by  univariate  analysis 
because  of  masking.  Also  a  set  of  outliers  can  produce  that  some  other  good  points  appear 
as  outlying,  and  this  situation  has  been  called  swamping  in  the  statistical  literature.  To 
avoid  these  problems,  we  need  to  consider  either  the  whole  predictive  density  or  the 
distributions  f(yj  | y ( j,)  and  f(y ,  j j)i  where  yj  is  a  subset  of  observations.  Of  course,  looking 

at  all  the  possible  decompositions  of  the  data  is  an  impossible  task  and  we  need  to  develop 
procedures  to  search  for  interesting  combinations  of  points.  My  joint  work  with  George 
Tiao  in  this  volume  may  be  a  first  step  on  this  direction. 

As  far  as  the  computation  of  f(yj|y,j,)  is  concerned  it  should  be  pointed  out  that  the 

easiest  way  to  understand  its  structure  is  to  use: 


(1) 


30 


instead  of  f(y)/f(y,[))-  The  reason  is  that  (1)  is  similar  to  the  standard  marginalization  to 

compute  the  predictive,  and  therefore,  standard  techniques  can  be  applied  to  obtain  the 
distribution  in  a  compact  way.  For  instance,  if  y  ~  )  with  a  known  and 

fi  -v  N(/iQ,ao),  straightforward  to  show  that 


ly, ,))  =  {- 


exp 


^  (yi_^ 


where 


and 


=  ((n-I)a-2  + 

Therefore,  the  cross-validation  predictive  density  for  the  subset  I  depends  on  the  ratio 
Sj/a^  =  S.^j(yi-yj)^/(I(7^)  that  is  a  key  factor  in  the  analysis  of  this  subset. 
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L.R.  Pericchi  (Simon  Bolivar  Universidad,  Caracas) 

It  would  be  a  promising  theoretical  exerdse  to  investigate  the  relationship  between 
your  interesting  suggestions  for  selecting  a  model  and  the  dimension  of  the  model.  As  an 
extreme  case,  one  may  think  of  a  model  that  encompasses  both  models  1  and  2  in  your 
illustrative  example,  and  then  compare  with  models  1  and  2.  Which  model  would  your 
suggestions  select? 
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L.I.  Pettit  (Goldsmiths’  College,  London) 

I  would  firstly  like  to  comment  on  the  measure  of  model  adequacy  discussed  in 
section  2.2  and  fill  in  some  of  their  history. 

The  possibility  of  using  dir  and  other  similar  ‘Bayesian  residuals’  was  discussed  by 
Pettit  (1986)  and  Geisser  (1987).  Chaloner  and  Brant  (1988)  suggest  a  different  definition 
of  Bayesian  residuals  using  an  idea  going  back  to  Zellner  (1975).  Geisser  (1987)  also 
discusses  the  use  of  dsr  which  he  describes  as  a  discordancy  ranking.  The  quantity  d4r, 
usually  called  the  conditional  predictive  ordinate  (CPO),  was  proposed  by  Geisser  (1980) 
and  used  by  Pettit  and  Smith  (1983,  1985)  and  Pettit  (1988)  as  a  tool  in  outlier  modelling. 
Pettit  (1990)  presents  a  number  of  results  about  the  CPO  for  the  normal  distribution.  The 
quantity  d^^  is  called  the  ’^atio  ordinate  measure  by  Pettit  (1990).  I  think  the  idea  of 
comparing  a  predictive  distribution  to  its  mode  goes  back  to  Roberts  (1965). 

As  far  a.'  model  choice  goes,  I  have  found  the  use  of  Bayes  factors,  which  do  not 
require  a  prior  probability  of  an  individual  model’s  correctness  (§2.3),  to  be  very  useful. 
The  approach  of  Spiegelhalter  and  Smith  (1982)  to  the  problem  of  improper  priors  is 
important.  Measuring  the  effect  of  individual  observations  on  Bayes  factors  (Pettit  and 
Young,  1990)  leads  to  an  expression  which  is  the  difference  in  logarithms  of  the  CPO’s  for 
the  two  models  and  so  ties  in  with  the  model  adequacy  ideas. 

The  computational  methods  discussed  in  this  paper  will  be  very  useful  in  calculating 
ail  these  quantities  and  it  is  therefore  for  me  a  very  welcome  paper. 
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FranQoise  Seillier— Moisdwitsch  (University  of  North  Carolina.,  Chapel  Hill) 

The  fundamental,  and  welcomed,  stance  of  this  paper  is  the  shift  of  emphasis,  in 
model  determination,  from  parameters  to  observables:  goodness-of— fit  criteria  are 
abandoned  in  favour  of  an  assessment  based  on  the  model’s  ability  to  produce  decent 
predictions.  The  selected  model  will  indeed  often  be  used  on  new  observables. 

Several  questions  arise  regarding  the  checking  functions  the  authors  adopted.  Which 
one  did  they  find  most  useful?  In  particular,  gi  focuses  on  a  single  characteristic  of  the 
predictive  distribution  whereas  gs  and  g^  take  into  account  the  whole  distribution  function. 
Situations  where  the  former  is  more  informative  are  likely  to  be  few.  For  model 
comparison,  have  they  found  the  difference  in  logarithmic  scores  more  revealing  than 
looking  at  the  difference  in  other  scores? 

Scoring  rules  can  also  be  of  use  in  checking  the  adequacy  of  a  single  model.  The 
results  of  Seillier— Moiseiwitsch  &  Dawid  (1990),  developed  for  discrete  outcomes  and 
bounded  scoring  rules,  can  be  adapted  for  continuous  variables.  These  results  assume  a 
natural  ordering  in  the  realizations.  The  probability  range  can  be  partitioned  into  a  fixed 
number  of  bins  and  the  probabilities,  under  the  predictive  distribution,  that  the  observable 
falls  in  each  of  these  bins  can  be  compared  to  the  indicators  of  the  realized  Bernoulli 
process.  The  score  constructed  over  a  number  of  outcomes,  once  normalized,  behaves 
asymptotically  like  a  N(0,1)  random  variable.  The  normalization  is  carried  out  with 
respect  to  the  predictive  distribution  at  each  point. 

The  authors  mention  the  difficulty  in  drawing  formal  inference  from  {dr}  due  to  the 
strong  dependency  among  these.  Transformations  that  yield  independent  random  variables 
could  be  considered,  for  instance,  by  conditioning  on  sufficient  statistics  in  the  probability 
integral  transform,  one  is  provided  with  a  set  of  i.i.d.  residuals  (O’Reilly  «k  Quesenberry, 
1973).  Let  Fn(y)  =  F(y|Tn)  where  Tn  is  a  minimal  sufficient  statistic,  {Fn(Yi)}  are  i.i.d. 
uniforms  on  [0,1].  Furthermore,  Fn(Yi),  Fn(Yj|  Yi),-  •  •,Fn(Yjj|  Yi,-  •  •  .Y^^)  also  generate 
a  set  of  a  i.i.d.  U[0,1]  random  variables,  where  a  is  the  number  of  components  in  the  vector 
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of  minimally  sufEdent  statistics.  This  conditional  transform  fits  particularly  well  in  a 
sequential  sampling  framework  (Seillier— Moisei witsch,  1990).  Indeed,  if  Tn  is  doubly 
transitive  and  adequate,  then  Fn-ci+i(Yj,.Q+j),' •  •,Fn(Yn)  have  the  same  distributional 
properties.  If  no  natural  ordering  exists,  the  transform  should  be  applied  to  the  ordering 
sample  (O’Reilly  &  Stephens,  1982). 
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Reply  to  the  discussion: 

We  thank  the  discussants  for  their  kind  and  generally  positive  remarks.  We  knew 
that  our  reference  list  for  this  active  research  area  was  very  incomplete  and  appreciate  the 
additional  citations  provided  in  the  discussion.  Pettit’s  historical  perspective  is  a 
particularly  welcome  supplement. 

Several  discussants  comment  upon  the  close  relationship  between  the  model 
determination  problem  and  the  issues  of  diagnosing  and  modeling  outliers.  Also  see  Draper 
and  Guttman  (1987).  We  note  that  sampling— based  approaches  expedite  calculations 
associated  with  these  issues.  See  for  instance,  the  recent  paper  of  Verdinelli  and 
Wasserman  (1991).  Pena  encourages  us  to  investigate  cross-validation  schemes  other  than 
single  point  deletion  in  particular  with  regard  to  identifying  masking  and  swamping.  He 
suggests  that  f(YjlY(jj)  be  computed.  We  concur  noting  that  the  methodology  in  section 
3  is  pertinent  to  such  computation.  Our  only  reservation  involves  possible  combinatoric 
problems  as  indicated  in  Pena  and  Tiao  (1991). 

Pericchi  raises  an  interesting  question  which  does  not  appear  to  have  a  simple  answer. 
The  difficulty  is  that,  in  general,  it  is  not  obvious  what  the  model  which  "encompasses 
both  models  1  and  2"  is.  In  customary  linear  models  it  is  clear;  we  merely  augment  the 
design  matrix  to  do  this.  However  consider  the  two  nonlinear  models  discussed  in  section  4 

X 

i.e.  model  1:  Y  =  ^3^(1  +  +  e,  model  2:  Y  =  The  encompassing 

model  which  is  additive  in  the  mean  structure  will  not  be  identifiable;  the  asymptote  is 
+  7q.  If  we  remedy  this  by  setting  =  7^,  we  can  no  longer  retrieve  model  1  or  model  2  as 
a  reduced  model.  Suppose  we  try  a  multiplicative  form  for  the  encompassing  model 

X 

Y  =  /?q(1+^j^2^)  *  e~^l '2  4-  £.  Now  the  reduced  models  are  not  identifiable;  /3i  =  0  or 
07  =  0  produces  model  2,  71  =  0  or  72  =  0  produces  model  1. 

Turning  to  the  remarks  of  Seillier— Moiseiwitsch  we  agree  that  the  checking  function 
g,  may  be  less  informative  than  the  others.  Nonetheless  examination  of  residuals  is 
standard  and  familiar.  Moreover,  the  resulting  dj  ^  have  an  immediate  connection  with 
Bayesian  residuals  as  discussed  in  Chaloner  and  Brant  (1988).  Th^  consider  the  posterior 
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distribution  of  the  unobserved  errors  which,  in  our  setting,  leads  to  the  distribution  of 
Cr  I Y  j  r )  =  y  r  -  <f{X.T,  0)\'fir)-  The  mean  of  this  distribution  is  d  ir. 

Her  suggestion  to  transform  the  {dir;  r=l,  •  •  • ,  n}  to  a  set  of  i.i.d.  U(0,1)  variates  is 
interesting  but  we  suspect  feasible  only  in  certain  simple  cases.  That  is,  preliminary 
reading  of  O’Reilly  and  Quesenberry  (1973)  yields  several  concerns.  Their  approach 
requires  the  joint  predictive  distribution,  f(Y),  to  be  proper,  requires  an  explicit  expression 
for  f(Y)  and  in  fact,  appears  to  require  that  f(Y)  be  an  exponential  family  to  effectively 
bring  (minimal)  sufficiency  into  play.  A  separate  problem  is  that,  even  were  we  able  to 
carry  out  the  calculations,  we  worry  about  the  inherent  order  dependence  of  the  results 
since  for  general  response  model  data  no  natural  ordering  need  exist. 

Finally,  Rafttty  offers  the  lengthiest  and  most  penetrating  discussion.  One  of  his 
main  points  concerns  the  computation  of  f(Y).  We  agree  that  this  can  be  done  and,  in  fact, 
at  the  end  of  section  3  mention  the  use  of  importance  sampling  densities  to  do  so.  Whether 
the  posterior  is  a  good  choice  is  unclear  since  the  resulting  harmonic  mean  estimator  may 
be  unstable.  Calculation  of  f(Y)  in  a  sequential  fashion  seems  silly.  In  most  cases  the 
effort  to  compute  f(Y)  directly  would  not  be  much  greater  than  that  required  to  compute 
an  individual  term  in  the  factorization.  We  also  note  the  aforementioned  concern  regarding 
the  inherent  order  dependence  which  is  not  mitigated  computationally  by  the  suggestion  to 
randomly  sample  permutations. 

More  importantly,  we  criticized  the  use  of  f(Y)  when  it  is  not  integrable  and  not 
because  we  couldn’t  compute  it.  We  completely  agree  that  the  choice  of  likelihood  is  the 
critical  problem  and  in  fact  say  so  in  the  introduction.  We  are  less  sanguine  about  the 
availability  of  proper  priors.  If  they  are  developed  through  training  data  (imaginary  or 
otherwise)  is  this  not  really  similar  in  spirit  to  cross-validation? 

Turning  to  our  criticism  of  the  standard  Bayesian  model  choice  procedure  there  are 
no  doubt  situations  where  we  may  knowledgeably  assign  prior  weights  to  models  in  which 
case  we  would  certainly  do  so  and  obtain  posterior  odds.  But  "garden  variety" 
specification  of  the  likelihood  with  r^aid  to  features  discussed  in  our  introduction  doesn’t 
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seem  to  readily  lend  itself  to  such  weighting.  However  we  thoroughly  agree  that  Bayes 
factors  (when  interpretable)  or  pseudo— Bayes  factors  are  vital  objects  to  compute  in 
comparing  models.  Still  these  factors  may  disadvantage  some  models  relative  to  others. 
Hence  we  value  the  information  obtained  through  other  checking  functions.  A  question 
requiring  further  analytic  and  empirical  elaboration  is,  in  the  case  of  proper  priors,  how 
different  will  the  Bayes  factor  and  the  pseudo— Bayes  factor  be  particularly  as  n  increases? 

In  conclusion  we  are  invigorated  by  all  of  the  discussion,  critical  or  otherwise.  Model 
determination  is  obviously  a  fundamental  data  analytic  task.  Illumination  of  its  aspects  in 
the  Bayesian  framework,  particularly  contentious  ones,  necessarily  enhances  our 
understanding  of  the  task. 


Additional  references: 

Draper,  N.R.  and  Guttman,  I.  (1987).  A  common  model  selection  criterion.  In: 
Probability  and  Bayesian  Statistics,  (R.  Viertl,  ed.)  Plenum  Publishing  Corp., 
Innsbruck,  Austria,  p.  134—150. 

Pena,  D.  and  Tiao,  G.C.  (1991).  Bayesian  outliers  functions  for  linear  models.  In: 
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